Chapter 9 - Decision trees
Chapter 10 - Bagging

Hands-on Machine Learning with R
Boookclub R-Ladies Utrecht and R-Ladies Den Bosch

Let’s get started :)

  • Organized by @RLadiesUtrecht and @RLadiesDenBosch
  • Meet-ups every 2 weeks on “Hands-On Machine Learning with R”
    by Bradley Boehmke and Brandon Greenwell
  • No session recording, but we will publish the slides and notes
  • We use HackMD for making shared notes and for the registry:
    HackMD notes chapters 9 & 10
  • Please keep mic off during presentation. Nice to have camera on and participate to make the meeting more interactive.
  • Questions? Raise hand / write question in HackMD or in chat
  • Remember presenters are not necessarily also subject experts
  • Remember the R-Ladies code of conduct.
    In summary, please be nice to each other and help us make an inclusive meeting!

What did we discuss so far?

  • Intro (Chapter 1 & 2) - Gerbrich Ferdinands

  • Feature & Target Engineering (Chapter 3) - Ale Segura

  • Linear & Logistic regression (Chapter 4 & 5) - Martine Jansen

  • Regularized regression (Chapter 6) - Marianna Sebő

  • MARS & K-nearest neighbors (Chapter 7 & 8) - Elena Dudukina

Chapter 9 - Decision trees

Decision trees

  • Make predictions by asking simple questions about features

  • Non-parametric, similar responses are grouped by splitting rules

  • Easy to interpret and visualize with tree diagrams

  • Downside: often perform worse than more complex algorithms

Terminology

CART

  • Classification and regression tree

  • Data is partitioned into similar subgroups

  • Each subgroup (or node) is created by asking simple yes/no questions about each feature (e.g., is age < 18?)

  • This is done a number of times, until the stopping criteria are reached (eg. maximum depth)

Regression versus classification

Regression trees predict the average response value in a subgroup; classification trees predict the class that this observation belongs to

Partitioning

  • Binary recursive partitioning

  • Objective at each node: find the “best” feature/split combination

  • The splitting process is then repeated in each of the two regions

  • Features can be used multiple times in the same tree

How deep?

Preventing overfitting

  • Restrict tree depth
  • Restrict minimum number of observations in terminal nodes
  • Pruning: make a complex tree first, simplify afterwards

Bias and variance

Prerequisites

# Helper packages
library(dplyr)       # for data wrangling
library(ggplot2)     # for awesome plotting

# Modeling packages
library(rsample)     # for sampling the data 
library(rpart)       # direct engine for decision tree application
library(caret)       # meta engine for decision tree application
library(ipred)       # bagging

# Model interpretability packages
library(rpart.plot)  # for plotting decision trees
library(vip)         # for feature importance
library(pdp)         # for feature effects


ames <- AmesHousing::make_ames()
set.seed(123)
split <- initial_split(ames, prop = 0.7,
                     strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

Ames housing example

ames_dt1 <- rpart(
  formula = Sale_Price ~ .,
  data    = ames_train,
  method  = "anova")

Feature interpretation

Automated feature selection: uninformative features are not used in the model.

Feature interpretation

A real-world example

Decision trees: pros

  • Easy to explain, visually appealing

  • Require little preprocessing

  • Not sensitive to outliers or missing data

  • Can handle mix of categorical and numeric features

Decision trees: cons

  • Not the best predictors (other models we’ve seen so far are better at predicting)

  • Simple yes/no questions result in rigid, non-smooth boundaries

  • Deep trees: low bias, high variance (risk of overfitting)

  • Shallow trees: high bias, low variance (low predictability)

Chapter 10 - Bagging

Bagging: bootstrap aggregating

  • Fit multiple prediction models and take the average

  • By model averaging, bagging helps to reduce variance and minimize overfitting

  • Especially useful for unstable, high variance models (where predicted output undergoes major changes in response to small changes in the training data)

Bagging - method

  • Create b bootstrap copies of the original training data

    • Bootstrapping: make new training sets by taking random samples with replacement
  • Fit your algorithm (commonly referred to as the base learner) to each bootstrap sample

  • New predictions are made by averaging predictions of the individual base learners

Bagging examples

Bagging 50-500 decision trees leads to optimal performance

Implementation

  • A single pruned decision tree performs worse than MARS or KNN

  • 100 unpruned, bagged decision trees perform better

  • Depending on number of iterations, this can become computationally intense

    • But: iterations are independent so easy to parallelize

Packages: {ipred} and {caret}

# make bootstrapping reproducible
set.seed(123)

# train bagged model (ipred package)
ames_bag1 <- ipred::bagging(
  formula = Sale_Price ~ .,
  data = ames_train,
  nbagg = 100,  
  coob = TRUE,
  control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2 <- caret::train(
  Sale_Price ~ .,
  data = ames_train,
  method = "treebag",
  trControl = trainControl(method = "cv", number = 10),
  nbagg = 200,  
  control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2

How many trees?

Feature importance

Feature importance

Bagging: pros and cons

Pro

  • Bagging improves prediction accuracy for high variance (and low bias) models

    • but: at expense of interpretability and computational speed

Con

  • Tree correlation: despite bootstrapping, many trees will be similar (esp. at the top)

Thanks!

We’re still looking for presenters, so let us know if you’re interested :)